Skip to content

性能优化:基础GPU-Driven、Instance Culling、光照优化等#138

Merged
YXHXianYu merged 18 commits intomainfrom
feat/gpu_culling
Mar 31, 2026
Merged

性能优化:基础GPU-Driven、Instance Culling、光照优化等#138
YXHXianYu merged 18 commits intomainfrom
feat/gpu_culling

Conversation

@YXHXianYu
Copy link
Copy Markdown
Member

接下来TODO:

阴影

  • Instance Culling会把视野外的屋顶剔除掉,导致阴影泄露

代码审查结果【AI】

正确性 Bug(建议提交前修复)

Shader 实例剔除逻辑有误 — [FrustumCull.comp.hlsl:103-125](vscode-file://vscode-app/c:/Program Files/Microsoft VS Code/cfbea10c5f/resources/app/out/vs/code/electron-browser/workbench/workbench.html)

当一个 primitive 有多个 instance 且只有部分可见时,shader 只是把 instance_cnt 设为 visible_count,但没有重排 instance bufferDrawIndexedIndirect 会从 first_instance 开始画前 visible_count 个 instance,而不是那些真正可见的 instance。

例如:5个instance中第0、2、4个可见 → visible_count=3 → GPU画第0、1、2个(instance 1、2不该被画

如果你的场景中每个 primitive 都只有 1 个 instance,这个 bug 不会触发。但如果有多实例的 primitive,会导致渲染错误。


为什么没有性能提升

三个原因:

问题 影响
每帧 CPU→GPU 全量上传 RestoreDrawCommands() 高 — 每帧从 CPU 内存重新上传整个 draw command buffer(staging buffer + DMA),开销可能已超过剔除收益
使用 vkCmdDrawIndexedIndirect 而非 vkCmdDrawIndexedIndirectCount 高 — GPU command processor 仍然遍历所有 draw command,instance_cnt=0 的 draw 并不是真正免费的
两次 pipeline barrier 中 — compute 和 graphics 之间的 barrier 强制 pipeline flush

推荐优化方向

  1. GPU→GPU buffer copy 替代 CPU→GPU upload — 在 GPU 端保留一份 pristine draw command buffer,用 CopyFrom(BufferView, BufferView) 做 GPU 端拷贝,避免每帧走 staging buffer
  2. 使用 vkCmdDrawIndexedIndirectCount — 引擎已经实现了这个 API(DrawIndexedIndirectCnt),culling shader 输出 visible draw count 到单独的 count buffer,这样 GPU 可以完全跳过被剔除的 draw
  3. Stream compaction — culling shader 中把可见的 draw 紧凑排列,配合 count buffer,这是 UE5/Nanite 的标准做法,也能顺便解决上面的实例正确性 bug

需要我实现这些优化吗?

@YXHXianYu YXHXianYu changed the title 最基础的GPU-Driven管线(Instance Culling等) 性能优化:基础GPU-Driven、Instance Culling、光照优化等 Mar 29, 2026
@YXHXianYu
Copy link
Copy Markdown
Member Author

潇神小作业 之 优化ShadowMaskPass

  • bindless load buffer有寄存器开销问题,需要换回cbv

优化前

image image

优化后

image

name: RTAO Register Optimization

overview: 通过代码重排、简化 RayQuery 控制流、消除运行时分支等手段,降低 RTAO shader 的寄存器峰值占用,提升 occupancy,缓解 MIOT stall。

todos:
  - id: simplify-rayquery
    content: "优化 1: 简化 CastVisibilityRay — 添加 RAY_FLAG_FORCE_OPAQUE,移除 while 循环和候选处理分支"
    status: completed
  - id: reorder-ray-weight
    content: "优化 2: 重排循环体 — 将 ray_weight 计算移到 CastVisibilityRay 之前,让 rand_vec 提前死亡"
    status: completed
  - id: eliminate-branch
    content: "优化 3: 消除 sample_mode 运行时分支 — 使用编译期方案替代"
    status: completed
  - id: blue-noise
    content: "优化 4: Blue Noise 替代 Hash RNG — 添加纹理参数,改善 BVH cache 命中率"
    status: completed
…ositePass来融合AO和SceneColor,AO本身不输出SceneColor;优化AoPass数据流;去除SR相关代码
@YXHXianYu YXHXianYu marked this pull request as ready for review March 31, 2026 12:15
@YXHXianYu YXHXianYu merged commit 39956b8 into main Mar 31, 2026
@YXHXianYu YXHXianYu deleted the feat/gpu_culling branch March 31, 2026 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant